Objectives

  • Use principal component analysis and factor analysis to investigate the underlying patterns of the dataset and identify the factors that are highly correlated to the satisfaction level of US airline passengers.
  • Predict customers’ level of satisfaction with linear discriminant analysis and compare the result with step-wise regression and logistic regression.
  • Visualize new components/factors and compare them across different passengers’ profile.

The Dataset

Source of data: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction

This dataset contains an airline passenger satisfaction survey.

Variable Description Values
PASSENGERS INFORMATION
Satisfaction (target) Airline satisfaction level satisfied, neutral, or dissatisfied
Gender Gender of the passengers Female, Male
Customer.Type The customer type Loyal customer, disloyal Customer
Age The actual age of the passengers
Type.of.Travel Purpose of the flight of the passengers Personal Travel, Business Travel
Class Travel class in the plane of the passengers Business, Eco, Eco Plus
FLIGHTS INFORMATION
Flight.Distance The flight distance of this journey
Departure.Delay.in.Minutes Minutes delayed when depart
Arrival.Delay.in.Minutes Minutes delayed when arrive
SERVICE RATINGS
Inflight.wifi.service Satisfaction level of the inflight wifi service 0: Not Applicable; 1 - 5
Departure.Arrival.time.convenient Satisfaction level of Departure/Arrival time convenient 0: Not Applicable; 1 - 5
Ease.of.Online.booking Satisfaction level of online booking 0: Not Applicable; 1 - 5
Gate.location Satisfaction level of gate location 1 - 5
Food.and.drink Satisfaction level of food and drink 0: Not Applicable; 1 - 5
Online.boarding Satisfaction level of online boarding 0: Not Applicable; 1 - 5
Seat.comfort Satisfaction level of seat comfort 1 - 5
Inflight.entertainment Satisfaction level of inflight entertainment 0: Not Applicable; 1 - 5
On-board.service Satisfaction level of on-board service 0: Not Applicable; 1 - 5
Leg.room.service Satisfaction level of leg room service 0: Not Applicable; 1 - 5
Baggage.handling Satisfaction level of baggage handling 1 - 5
Checkin.service Satisfaction level of Check-in service 1 - 5
Inflight.service Satisfaction level of inflight service 0: Not Applicable; 1 - 5
Cleanliness Satisfaction level of cleanliness 0: Not Applicable; 1 - 5

Exploratory Data Analysis

# overview of data attributes
str(air)
## 'data.frame':    25976 obs. of  24 variables:
##  $ id                               : int  19556 90035 12360 77959 36875 39177 79433 97286 27508 62482 ...
##  $ Gender                           : chr  "Female" "Female" "Male" "Male" ...
##  $ Customer.Type                    : chr  "Loyal Customer" "Loyal Customer" "disloyal Customer" "Loyal Customer" ...
##  $ Age                              : int  52 36 20 44 49 16 77 43 47 46 ...
##  $ Type.of.Travel                   : chr  "Business travel" "Business travel" "Business travel" "Business travel" ...
##  $ Class                            : chr  "Eco" "Business" "Eco" "Business" ...
##  $ Flight.Distance                  : int  160 2863 192 3377 1182 311 3987 2556 556 1744 ...
##  $ Inflight.wifi.service            : int  5 1 2 0 2 3 5 2 5 2 ...
##  $ Departure.Arrival.time.convenient: int  4 1 0 0 3 3 5 2 2 2 ...
##  $ Ease.of.Online.booking           : int  3 3 2 0 4 3 5 2 2 2 ...
##  $ Gate.location                    : int  4 1 4 2 3 3 5 2 2 2 ...
##  $ Food.and.drink                   : int  3 5 2 3 4 5 3 4 5 3 ...
##  $ Online.boarding                  : int  4 4 2 4 1 5 5 4 5 4 ...
##  $ Seat.comfort                     : int  3 5 2 4 2 3 5 5 5 4 ...
##  $ Inflight.entertainment           : int  5 4 2 1 2 5 5 4 5 4 ...
##  $ On.board.service                 : int  5 4 4 1 2 4 5 4 2 4 ...
##  $ Leg.room.service                 : int  5 4 1 1 2 3 5 4 2 4 ...
##  $ Baggage.handling                 : int  5 4 3 1 2 1 5 4 5 4 ...
##  $ Checkin.service                  : int  2 3 2 3 4 1 4 5 3 5 ...
##  $ Inflight.service                 : int  5 4 2 1 2 2 5 4 3 4 ...
##  $ Cleanliness                      : int  5 5 2 4 4 5 3 3 5 4 ...
##  $ Departure.Delay.in.Minutes       : int  50 0 0 0 0 0 0 77 1 28 ...
##  $ Arrival.Delay.in.Minutes         : num  44 0 0 6 20 0 0 65 0 14 ...
##  $ satisfaction                     : chr  "satisfied" "satisfied" "neutral or dissatisfied" "satisfied" ...
# descriptive summary of the data
summary(air)
##        id            Gender          Customer.Type           Age       
##  Min.   :    17   Length:25976       Length:25976       Min.   : 7.00  
##  1st Qu.: 32170   Class :character   Class :character   1st Qu.:27.00  
##  Median : 65320   Mode  :character   Mode  :character   Median :40.00  
##  Mean   : 65006                                         Mean   :39.62  
##  3rd Qu.: 97584                                         3rd Qu.:51.00  
##  Max.   :129877                                         Max.   :85.00  
##                                                                        
##  Type.of.Travel        Class           Flight.Distance Inflight.wifi.service
##  Length:25976       Length:25976       Min.   :  31    Min.   :0.000        
##  Class :character   Class :character   1st Qu.: 414    1st Qu.:2.000        
##  Mode  :character   Mode  :character   Median : 849    Median :3.000        
##                                        Mean   :1194    Mean   :2.725        
##                                        3rd Qu.:1744    3rd Qu.:4.000        
##                                        Max.   :4983    Max.   :5.000        
##                                                                             
##  Departure.Arrival.time.convenient Ease.of.Online.booking Gate.location  
##  Min.   :0.000                     Min.   :0.000          Min.   :1.000  
##  1st Qu.:2.000                     1st Qu.:2.000          1st Qu.:2.000  
##  Median :3.000                     Median :3.000          Median :3.000  
##  Mean   :3.047                     Mean   :2.757          Mean   :2.977  
##  3rd Qu.:4.000                     3rd Qu.:4.000          3rd Qu.:4.000  
##  Max.   :5.000                     Max.   :5.000          Max.   :5.000  
##                                                                          
##  Food.and.drink  Online.boarding  Seat.comfort   Inflight.entertainment
##  Min.   :0.000   Min.   :0.000   Min.   :1.000   Min.   :0.000         
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000         
##  Median :3.000   Median :4.000   Median :4.000   Median :4.000         
##  Mean   :3.215   Mean   :3.262   Mean   :3.449   Mean   :3.358         
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000         
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000         
##                                                                        
##  On.board.service Leg.room.service Baggage.handling Checkin.service
##  Min.   :0.000    Min.   :0.00     Min.   :1.000    Min.   :1.000  
##  1st Qu.:2.000    1st Qu.:2.00     1st Qu.:3.000    1st Qu.:3.000  
##  Median :4.000    Median :4.00     Median :4.000    Median :3.000  
##  Mean   :3.386    Mean   :3.35     Mean   :3.633    Mean   :3.314  
##  3rd Qu.:4.000    3rd Qu.:4.00     3rd Qu.:5.000    3rd Qu.:4.000  
##  Max.   :5.000    Max.   :5.00     Max.   :5.000    Max.   :5.000  
##                                                                    
##  Inflight.service  Cleanliness    Departure.Delay.in.Minutes
##  Min.   :0.000    Min.   :0.000   Min.   :   0.00           
##  1st Qu.:3.000    1st Qu.:2.000   1st Qu.:   0.00           
##  Median :4.000    Median :3.000   Median :   0.00           
##  Mean   :3.649    Mean   :3.286   Mean   :  14.31           
##  3rd Qu.:5.000    3rd Qu.:4.000   3rd Qu.:  12.00           
##  Max.   :5.000    Max.   :5.000   Max.   :1128.00           
##                                                             
##  Arrival.Delay.in.Minutes satisfaction      
##  Min.   :   0.00          Length:25976      
##  1st Qu.:   0.00          Class :character  
##  Median :   0.00          Mode  :character  
##  Mean   :  14.74                            
##  3rd Qu.:  13.00                            
##  Max.   :1115.00                            
##  NA's   :83

There are several points to note from our descriptive analysis so far:

  • The ranking features don’t all share the same scale. This is because in these features, 0 denotes that the services were not offered on the flights, therefore not applicable for the customer to rank. With features where the ranking ranges from 1 - 5, it can be understood that the services were offered on all flights.

    • 0 - 5 : Inflight.wifi.service, Departure.Arrival.time.convenient, Ease.of.Online.booking, Food.and.drink, Online.boarding, Inflight.entertainment, On.board.service, Leg.room.service, Inflight.service, Cleanliness

    • 1 - 5 : Gate.location, Seat.comfort, Baggage.handling, Checkin.service

  • Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes have entries of 0 in them. These entries do not represent missing values, but rather that flights did not experience any delays in arrival or departure.

  • There are 83 NA values in Arrival.Delay.in.Minutes.

Data Cleaning

To avoid any bias and misinterpretation of data, we will check for missing values in the dataset. If missing values exist, we will look further into each column.

# visualize missing data
vis_miss(air) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

# total number of missing values
cat("\nTotal number of missing values:", sum(is.na(air)))
## 
## Total number of missing values: 83
# percentage of missing data
cat("\nPercentage of missing data:", (sum(is.na(air))/nrow(air))*100, "%")
## 
## Percentage of missing data: 0.3195257 %

The visualization of missing values for the entire dataset confirms our observation earlier, that there are only NA values in Arrival.Delay.in.Minutes. Since they constitute only 0.32% of the entire dataset, it is safe to remove all rows with NA values.

# drop missing values
air <- air %>% drop_na()
cat("Number of missing values:", sum(is.na(air)))
## Number of missing values: 0

We also want to check if there are duplicated records in the dataset. Since ID is the unique key of the dataset, we can use ID to check and identify if duplication exists.

# check for duplicates
sum(duplicated(air$id))
## [1] 0

Data Variation

Response Variable

# count and percentage of each categories in satisfaction
satisfaction_count <- as.data.frame(table(air$satisfaction))
satisfaction_count$percent <- satisfaction_count$Freq / sum(satisfaction_count$Freq) * 100

# pie chart
ggplot(satisfaction_count, aes(x = '', y = Freq, fill = factor(Var1))) +
  geom_bar(stat = 'identity', width = 1) +
  coord_polar(theta = 'y') +
  theme_void() +
  scale_fill_manual(values = c('#3F7ED5', '#C7DFF9'), name = '') +
  labs(title = 'Pie chart of satisfaction') +
  theme(legend.position = 'right',
        legend.text = element_text(size = 10),
        plot.title = element_text(face = 'bold', hjust = 0.5)) +
  geom_text(aes(label = paste0(round(percent, 1), '%')),
            position = position_stack(vjust = 0.5))

Categorical Variables

# name of categorical columns
cat_cols <- colnames(air[c('Class', 'Customer.Type', 'Gender', 'Type.of.Travel')])

for (col in cat_cols) {
  # a data frame with the count of each category in the current column
  cat_count <- data.frame(table(air[[col]]))

  # a bar plot of the count data
  print(ggplot(cat_count, aes(x = Var1, y = Freq)) +
    geom_bar(stat = 'identity', fill = '#3F7ED5') +
    geom_text(aes(label = Freq), vjust = -0.5) +
    theme(plot.title = element_text(face = 'bold'),
          axis.ticks = element_blank()) +
    labs(title = col, x = '', y = "Count"))
}

Numerical Variables

# name of numerical columns
num_cols <- air %>%
  select_if(function(x) length(unique(x)) > 6) %>%
  colnames

# histogram for each numerical column
for (col in num_cols) {
  print(ggplot(air, aes(x = .data[[col]])) +
  geom_histogram(binwidth = 0.5, color = '#3F7ED5')+
  labs(title = col, x = '', y = 'Frequency') +
  theme(axis.ticks = element_blank(),
        plot.title= element_text(face = "bold")))
}

Rank Variables

# name of columns with rank data
rank_cols <- air %>%
  select_if(~ is.numeric(.) && n_distinct(.) <= 6) %>%
  colnames()

# reshape data from wide to long format
rank_long <- air %>%
  pivot_longer(cols = all_of(rank_cols),
                names_to = 'column_name',
                values_to = 'rank')

# group data by column name and rank
# calculate the count of each rank in each column
rank_counts <- rank_long %>%
  group_by(column_name, rank) %>%
  summarise(count = n(), .groups = 'drop')
# plot stacked bars
ggplot(rank_counts, aes(x = column_name, y = count, fill = factor(rank))) +
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c('#C7DFF9', '#A4C8F0', '#7FADEB', '#5E95E0', '#3F7ED5', '#2B5FA6'),
                    name = 'Rank',
                    labels = c('0', '1', '2', '3', '4', '5')) +
  labs(x = '', y = 'Count', fill = "Rank") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.ticks = element_blank(),
        panel.background = element_blank())

for (col in rank_cols) {
  # a data frame with the count of each category in the current column
  rank_count <- data.frame(table(air[[col]]))

  # a bar plot of the count data
  print(ggplot(rank_count, aes(x = Var1, y = Freq)) +
    geom_bar(stat = 'identity', fill = '#3F7ED5') +
    geom_text(aes(label = Freq), vjust = -0.5) +
    theme(plot.title = element_text(face = 'bold'),
          axis.ticks = element_blank()) +
    labs(title = col, x = '', y = "Count"))
}

Correlation

# correlation matrix
corr_matrix <- correlate(air[,c(4, 7:23)], diagonal = 1)
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
corr_matrix
## # A tibble: 18 × 19
##    term       Age Flight…¹ Inflig…² Depart…³ Ease.o…⁴ Gate.l…⁵ Food.a…⁶ Online…⁷
##    <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 Age    1        9.99e-2  0.00912  3.19e-2  1.37e-2  2.68e-3  0.0246   0.203  
##  2 Flig…  0.0999   1   e+0  0.00460 -1.49e-2  6.20e-2  7.95e-3  0.0578   0.215  
##  3 Infl…  0.00912  4.60e-3  1        3.49e-1  7.11e-1  3.48e-1  0.122    0.459  
##  4 Depa…  0.0319  -1.49e-2  0.349    1   e+0  4.40e-1  4.59e-1 -0.0153   0.0809 
##  5 Ease…  0.0137   6.20e-2  0.711    4.40e-1  1   e+0  4.66e-1  0.0255   0.408  
##  6 Gate…  0.00268  7.95e-3  0.348    4.59e-1  4.66e-1  1   e+0 -0.00941  0.00706
##  7 Food…  0.0246   5.78e-2  0.122   -1.53e-2  2.55e-2 -9.41e-3  1        0.230  
##  8 Onli…  0.203    2.15e-1  0.459    8.09e-2  4.08e-1  7.06e-3  0.230    1      
##  9 Seat…  0.155    1.59e-1  0.116   -2.01e-3  2.30e-2 -1.05e-3  0.582    0.416  
## 10 Infl…  0.0695   1.38e-1  0.201   -2.21e-2  4.46e-2 -4.78e-4  0.627    0.279  
## 11 On.b…  0.0552   1.18e-1  0.114    6.08e-2  4.02e-2 -3.14e-2  0.0513   0.150  
## 12 Leg.…  0.0330   1.37e-1  0.160    3.37e-3  1.17e-1 -2.28e-3  0.0364   0.121  
## 13 Bagg… -0.0505   7.12e-2  0.119    6.56e-2  4.07e-2 -4.20e-3  0.0378   0.0846 
## 14 Chec…  0.0258   7.53e-2  0.0465   8.29e-2  4.58e-4 -5.46e-2  0.0773   0.204  
## 15 Infl… -0.0593   6.61e-2  0.109    6.80e-2  3.55e-2 -5.24e-3  0.0407   0.0723 
## 16 Clea…  0.0489   1.06e-1  0.125   -7.07e-3  1.10e-2 -1.42e-2  0.660    0.321  
## 17 Depa… -0.00545  2.40e-3 -0.0104  -2.25e-4 -1.44e-3  7.62e-3 -0.0267  -0.0225 
## 18 Arri… -0.00740  1.31e-4 -0.0123  -1.35e-3 -3.16e-3  7.78e-3 -0.0283  -0.0259 
## # … with 10 more variables: Seat.comfort <dbl>, Inflight.entertainment <dbl>,
## #   On.board.service <dbl>, Leg.room.service <dbl>, Baggage.handling <dbl>,
## #   Checkin.service <dbl>, Inflight.service <dbl>, Cleanliness <dbl>,
## #   Departure.Delay.in.Minutes <dbl>, Arrival.Delay.in.Minutes <dbl>, and
## #   abbreviated variable names ¹​Flight.Distance, ²​Inflight.wifi.service,
## #   ³​Departure.Arrival.time.convenient, ⁴​Ease.of.Online.booking,
## #   ⁵​Gate.location, ⁶​Food.and.drink, ⁷​Online.boarding
# heatmap of correlation
corr_matrix %>%
  rearrange(method = "MDS", absolute = FALSE) %>%
  shave() %>%
  rplot(shape = 15, colours = c("#D3D3D3", "#3F7ED5")) +
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1),
        axis.text = element_text(size = 8),
        axis.ticks = element_blank())

Before applying dimension deduction technique to the dataset, we can look at the correlation matrix to identify any strong / relatively strong positive or negative correlations between the numerical features.

If we set a cutoff point of 0.3 for the correlation coefficient, we can see that Age, Flight.Distance and Checkin.service does not have any correlation coefficients higher than 0.3, indicating that these 3 variables are highly independent from other variables and should be excluded from our analysis.

# set a cut-off point for correlation coefficients
cutoff <- function(x) all(abs(x) < 0.3, na.rm = TRUE)

# extract variables that has no coefficients
correlate(air[,c(4, 7:23)]) %>%
  focus_if(cutoff, mirror = FALSE)
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
## # A tibble: 15 × 4
##    term                                   Age Flight.Distance Checkin.service
##    <chr>                                <dbl>           <dbl>           <dbl>
##  1 Inflight.wifi.service              0.00912        0.00460         0.0465  
##  2 Departure.Arrival.time.convenient  0.0319        -0.0149          0.0829  
##  3 Ease.of.Online.booking             0.0137         0.0620          0.000458
##  4 Gate.location                      0.00268        0.00795        -0.0546  
##  5 Food.and.drink                     0.0246         0.0578          0.0773  
##  6 Online.boarding                    0.203          0.215           0.204   
##  7 Seat.comfort                       0.155          0.159           0.183   
##  8 Inflight.entertainment             0.0695         0.138           0.115   
##  9 On.board.service                   0.0552         0.118           0.248   
## 10 Leg.room.service                   0.0330         0.137           0.151   
## 11 Baggage.handling                  -0.0505         0.0712          0.240   
## 12 Inflight.service                  -0.0593         0.0661          0.240   
## 13 Cleanliness                        0.0489         0.106           0.165   
## 14 Departure.Delay.in.Minutes        -0.00545        0.00240        -0.0209  
## 15 Arrival.Delay.in.Minutes          -0.00740        0.000131       -0.0271

Subsets

Since we are going to investigate the underlying patterns of the dataset and identify the factors that are highly correlated to the satisfaction level of US airline passengers, only numerical and rank data will be used. We will also exclude the Age, Flight.Distance and Checkin.service variable as mentioned above. In a later section, we will consider a subset of all numerical/rank and categorical variables for logistic regression.

As such, several subsets will be created according to the purpose of each analysis.

# numerical and ranked variables
# with satisfaction variable
df <- air[,c(8:18, 20:24)]
cat("Subset df\nDimension:", dim(df), "\n\n")
## Subset df
## Dimension: 25893 16
colnames(df)
##  [1] "Inflight.wifi.service"             "Departure.Arrival.time.convenient"
##  [3] "Ease.of.Online.booking"            "Gate.location"                    
##  [5] "Food.and.drink"                    "Online.boarding"                  
##  [7] "Seat.comfort"                      "Inflight.entertainment"           
##  [9] "On.board.service"                  "Leg.room.service"                 
## [11] "Baggage.handling"                  "Inflight.service"                 
## [13] "Cleanliness"                       "Departure.Delay.in.Minutes"       
## [15] "Arrival.Delay.in.Minutes"          "satisfaction"
head(df)
##   Inflight.wifi.service Departure.Arrival.time.convenient
## 1                     5                                 4
## 2                     1                                 1
## 3                     2                                 0
## 4                     0                                 0
## 5                     2                                 3
## 6                     3                                 3
##   Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1                      3             4              3               4
## 2                      3             1              5               4
## 3                      2             4              2               2
## 4                      0             2              3               4
## 5                      4             3              4               1
## 6                      3             3              5               5
##   Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1            3                      5                5                5
## 2            5                      4                4                4
## 3            2                      2                4                1
## 4            4                      1                1                1
## 5            2                      2                2                2
## 6            3                      5                4                3
##   Baggage.handling Inflight.service Cleanliness Departure.Delay.in.Minutes
## 1                5                5           5                         50
## 2                4                4           5                          0
## 3                3                2           2                          0
## 4                1                1           4                          0
## 5                2                2           4                          0
## 6                1                2           5                          0
##   Arrival.Delay.in.Minutes            satisfaction
## 1                       44               satisfied
## 2                        0               satisfied
## 3                        0 neutral or dissatisfied
## 4                        6               satisfied
## 5                       20               satisfied
## 6                        0               satisfied
# numerical and ranked variables
# without satisfaction variable (for PCA & FA)
df1 <- df[-16]
cat("Subset df1\nDimension:", dim(df1), "\n\n")
## Subset df1
## Dimension: 25893 15
colnames(df1)
##  [1] "Inflight.wifi.service"             "Departure.Arrival.time.convenient"
##  [3] "Ease.of.Online.booking"            "Gate.location"                    
##  [5] "Food.and.drink"                    "Online.boarding"                  
##  [7] "Seat.comfort"                      "Inflight.entertainment"           
##  [9] "On.board.service"                  "Leg.room.service"                 
## [11] "Baggage.handling"                  "Inflight.service"                 
## [13] "Cleanliness"                       "Departure.Delay.in.Minutes"       
## [15] "Arrival.Delay.in.Minutes"
head(df1)
##   Inflight.wifi.service Departure.Arrival.time.convenient
## 1                     5                                 4
## 2                     1                                 1
## 3                     2                                 0
## 4                     0                                 0
## 5                     2                                 3
## 6                     3                                 3
##   Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1                      3             4              3               4
## 2                      3             1              5               4
## 3                      2             4              2               2
## 4                      0             2              3               4
## 5                      4             3              4               1
## 6                      3             3              5               5
##   Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1            3                      5                5                5
## 2            5                      4                4                4
## 3            2                      2                4                1
## 4            4                      1                1                1
## 5            2                      2                2                2
## 6            3                      5                4                3
##   Baggage.handling Inflight.service Cleanliness Departure.Delay.in.Minutes
## 1                5                5           5                         50
## 2                4                4           5                          0
## 3                3                2           2                          0
## 4                1                1           4                          0
## 5                2                2           4                          0
## 6                1                2           5                          0
##   Arrival.Delay.in.Minutes
## 1                       44
## 2                        0
## 3                        0
## 4                        6
## 5                       20
## 6                        0
# only ranked features
# without satisfaction variable (for CA)
df1_CA <- subset(df1, select = -c(4,7,11,14,15))
cat("Subset df_CA\nDimension:", dim(df1_CA), "\n\n")
## Subset df_CA
## Dimension: 25893 10
colnames(df1_CA)
##  [1] "Inflight.wifi.service"             "Departure.Arrival.time.convenient"
##  [3] "Ease.of.Online.booking"            "Food.and.drink"                   
##  [5] "Online.boarding"                   "Inflight.entertainment"           
##  [7] "On.board.service"                  "Leg.room.service"                 
##  [9] "Inflight.service"                  "Cleanliness"
head(df1_CA)
##   Inflight.wifi.service Departure.Arrival.time.convenient
## 1                     5                                 4
## 2                     1                                 1
## 3                     2                                 0
## 4                     0                                 0
## 5                     2                                 3
## 6                     3                                 3
##   Ease.of.Online.booking Food.and.drink Online.boarding Inflight.entertainment
## 1                      3              3               4                      5
## 2                      3              5               4                      4
## 3                      2              2               2                      2
## 4                      0              3               4                      1
## 5                      4              4               1                      2
## 6                      3              5               5                      5
##   On.board.service Leg.room.service Inflight.service Cleanliness
## 1                5                5                5           5
## 2                4                4                4           5
## 3                4                1                2           2
## 4                1                1                1           4
## 5                2                2                2           4
## 6                4                3                2           5
# all variables in the original dataset except id (for logistic regression model)
all_df <- air[-1]
cat("Subset all_df\nDimension:", dim(all_df), "\n\n")
## Subset all_df
## Dimension: 25893 23
colnames(all_df)
##  [1] "Gender"                            "Customer.Type"                    
##  [3] "Age"                               "Type.of.Travel"                   
##  [5] "Class"                             "Flight.Distance"                  
##  [7] "Inflight.wifi.service"             "Departure.Arrival.time.convenient"
##  [9] "Ease.of.Online.booking"            "Gate.location"                    
## [11] "Food.and.drink"                    "Online.boarding"                  
## [13] "Seat.comfort"                      "Inflight.entertainment"           
## [15] "On.board.service"                  "Leg.room.service"                 
## [17] "Baggage.handling"                  "Checkin.service"                  
## [19] "Inflight.service"                  "Cleanliness"                      
## [21] "Departure.Delay.in.Minutes"        "Arrival.Delay.in.Minutes"         
## [23] "satisfaction"
head(all_df)
##   Gender     Customer.Type Age  Type.of.Travel    Class Flight.Distance
## 1 Female    Loyal Customer  52 Business travel      Eco             160
## 2 Female    Loyal Customer  36 Business travel Business            2863
## 3   Male disloyal Customer  20 Business travel      Eco             192
## 4   Male    Loyal Customer  44 Business travel Business            3377
## 5 Female    Loyal Customer  49 Business travel      Eco            1182
## 6   Male    Loyal Customer  16 Business travel      Eco             311
##   Inflight.wifi.service Departure.Arrival.time.convenient
## 1                     5                                 4
## 2                     1                                 1
## 3                     2                                 0
## 4                     0                                 0
## 5                     2                                 3
## 6                     3                                 3
##   Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1                      3             4              3               4
## 2                      3             1              5               4
## 3                      2             4              2               2
## 4                      0             2              3               4
## 5                      4             3              4               1
## 6                      3             3              5               5
##   Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1            3                      5                5                5
## 2            5                      4                4                4
## 3            2                      2                4                1
## 4            4                      1                1                1
## 5            2                      2                2                2
## 6            3                      5                4                3
##   Baggage.handling Checkin.service Inflight.service Cleanliness
## 1                5               2                5           5
## 2                4               3                4           5
## 3                3               2                2           2
## 4                1               3                1           4
## 5                2               4                2           4
## 6                1               1                2           5
##   Departure.Delay.in.Minutes Arrival.Delay.in.Minutes            satisfaction
## 1                         50                       44               satisfied
## 2                          0                        0               satisfied
## 3                          0                        0 neutral or dissatisfied
## 4                          0                        6               satisfied
## 5                          0                       20               satisfied
## 6                          0                        0               satisfied

Principal Component Analysis

Following the literature review on the study of flooding in Metro Manila, Principle Component can be used to analyze which factor(s) make significant contributions to the model. In fact, PCA is commonly applied to interval data, where it can help to identify patterns and relationships among the variables, and to reduce the dimension of the data to a smaller set of meaningful principal components.

It’s important to consider the scaling of the data before performing the procedure, as PCA is sensitive to the scale of the input variables. As such, we will first standardize the data before applying PCA to ensure that each variable contributes equally to the analysis.

# perform pca on scaled data
airline.pca <- prcomp(df1, scale = TRUE) # only for numeric variables

# eigen vectors
airline.pca$rotation
##                                           PC1          PC2          PC3
## Inflight.wifi.service             -0.23925274  0.442446506 -0.072256918
## Departure.Arrival.time.convenient -0.09100587  0.423008024 -0.009790826
## Ease.of.Online.booking            -0.16793110  0.525062498 -0.079347656
## Gate.location                     -0.06816985  0.426679040 -0.068023649
## Food.and.drink                    -0.32013129 -0.186270227 -0.324389216
## Online.boarding                   -0.28060823  0.151252291 -0.174564835
## Seat.comfort                      -0.35303712 -0.173392624 -0.301887713
## Inflight.entertainment            -0.44276714 -0.183947275 -0.035481236
## On.board.service                  -0.27187035 -0.041004501  0.402184247
## Leg.room.service                  -0.22414466  0.009282316  0.314362300
## Baggage.handling                  -0.26107477 -0.028056947  0.444962063
## Inflight.service                  -0.26524868 -0.036398933  0.455456697
## Cleanliness                       -0.36588486 -0.199220406 -0.295681044
## Departure.Delay.in.Minutes         0.03810765  0.035453456 -0.031515902
## Arrival.Delay.in.Minutes           0.04159285  0.035480879 -0.035143410
##                                            PC4           PC5         PC6
## Inflight.wifi.service              0.014255707 -0.2583079942  0.01615871
## Departure.Arrival.time.convenient  0.018576656  0.4304499504 -0.19652154
## Ease.of.Online.booking             0.020163115 -0.1716503762  0.05025981
## Gate.location                      0.014384189  0.4957027522  0.19688093
## Food.and.drink                    -0.009095508  0.1922186236  0.06927214
## Online.boarding                    0.009929605 -0.5922951978 -0.26613221
## Seat.comfort                      -0.009707081  0.0262176970 -0.02419582
## Inflight.entertainment            -0.038343648  0.1388067685  0.04318400
## On.board.service                  -0.027666962 -0.0109842829 -0.24341472
## Leg.room.service                  -0.060367847 -0.1784212240  0.84056064
## Baggage.handling                  -0.052896100  0.0743482204 -0.18386908
## Inflight.service                  -0.004691003  0.0977359580 -0.19956432
## Cleanliness                       -0.024280614  0.1343504709  0.02902730
## Departure.Delay.in.Minutes        -0.703556951 -0.0017109787 -0.02802289
## Arrival.Delay.in.Minutes          -0.703051041 -0.0008505072 -0.02646164
##                                            PC7          PC8          PC9
## Inflight.wifi.service             -0.331814354  0.203126383 -0.172887764
## Departure.Arrival.time.convenient  0.616187157  0.435267124  0.007399416
## Ease.of.Online.booking            -0.192705604  0.074744900 -0.089768862
## Gate.location                     -0.175084370 -0.630002080  0.146378703
## Food.and.drink                    -0.238111965  0.315086823 -0.227086287
## Online.boarding                    0.258216460 -0.210124796  0.246312156
## Seat.comfort                       0.299475384 -0.273020343  0.234586435
## Inflight.entertainment            -0.165104081  0.010025613 -0.120499899
## On.board.service                   0.213008874 -0.345405221 -0.694872721
## Leg.room.service                   0.298555787  0.083790568  0.037393682
## Baggage.handling                  -0.182430786  0.119854856  0.475004088
## Inflight.service                  -0.183196318  0.080080703  0.214798661
## Cleanliness                        0.026848489  0.040122396  0.052278611
## Departure.Delay.in.Minutes        -0.002968948 -0.002197422 -0.002899163
## Arrival.Delay.in.Minutes          -0.004984626 -0.001288131 -0.005081611
##                                            PC10         PC11         PC12
## Inflight.wifi.service             -0.0391357702  0.384060682  0.177449542
## Departure.Arrival.time.convenient  0.0148089349 -0.012165847 -0.025975003
## Ease.of.Online.booking             0.0030552308 -0.018017931  0.049808244
## Gate.location                      0.0001411596 -0.177704492 -0.109646749
## Food.and.drink                    -0.0692629335 -0.657893067  0.153692996
## Online.boarding                    0.0562326993 -0.389860762 -0.264886945
## Seat.comfort                       0.0168528329  0.241766353  0.663287572
## Inflight.entertainment             0.0541198690  0.165538175 -0.057945723
## On.board.service                  -0.1614731331 -0.022311121 -0.004523106
## Leg.room.service                   0.0155351694 -0.080645772 -0.024227512
## Baggage.handling                  -0.6417776514 -0.037900979  0.022180215
## Inflight.service                   0.7400799534 -0.061252834  0.045749796
## Cleanliness                       -0.0068494931  0.370882894 -0.642616016
## Departure.Delay.in.Minutes         0.0260697207 -0.005101518  0.008508640
## Arrival.Delay.in.Minutes           0.0207106884 -0.002688373  0.004332571
##                                            PC13         PC14          PC15
## Inflight.wifi.service              0.5305200744  0.177102775 -0.0001820452
## Departure.Arrival.time.convenient  0.0924398214 -0.098220519 -0.0002768573
## Ease.of.Online.booking            -0.7755102724 -0.049020766  0.0011223177
## Gate.location                      0.1753357304  0.025544328 -0.0003144714
## Food.and.drink                     0.0657604889  0.206510285  0.0018226304
## Online.boarding                    0.1824869919 -0.126697223  0.0004190597
## Seat.comfort                      -0.1176710431  0.144882554  0.0041637389
## Inflight.entertainment             0.0463037080 -0.817617553 -0.0068686624
## On.board.service                  -0.0406325198  0.163530509  0.0023330957
## Leg.room.service                   0.0420493392  0.046602660  0.0014280155
## Baggage.handling                  -0.0391020525  0.048132253  0.0005464738
## Inflight.service                  -0.0239942538  0.176585085  0.0066054111
## Cleanliness                       -0.1367458514  0.381063375  0.0011106019
## Departure.Delay.in.Minutes        -0.0008395439  0.010958723 -0.7068211337
## Arrival.Delay.in.Minutes           0.0020251768 -0.001120516  0.7073059859
# variance of PCs / eigen values
s <- (airline.pca$sdev)^2

# a table of PC number and cumulative variance
eigenval_table <- data.frame(
  eigen_value = s
)
rownames(eigenval_table) <- paste0("PC", 1:length(s))

# print the table
print(eigenval_table)
##      eigen_value
## PC1   3.70318105
## PC2   2.40761587
## PC3   2.15207400
## PC4   1.96140530
## PC5   1.03920635
## PC6   0.71103945
## PC7   0.57143431
## PC8   0.50776351
## PC9   0.47330385
## PC10  0.36360545
## PC11  0.33100581
## PC12  0.28709740
## PC13  0.25918968
## PC14  0.19664017
## PC15  0.03543781

Optimal Number of Components

There are 15 principle components (PCs), each consisting of parts of all variables and capture different portions of the total variance. To determine the optimal number of PCs to retain, we use 3 approaches: Scree plot, Total Variance and Kaiser Criterion.

# Scree plot
fviz_eig(airline.pca) +
  geom_vline(xintercept = 5, linetype = "dashed", col = 'red')

# total variance

# cumulative variance
rs <- s/sum(s)

# a table of PC number and cumulative variance
var_table <- data.frame(
  cumulative_variance = cumsum(rs)
)
rownames(var_table) <- paste0("PC", 1:length(rs))

# print the table
print(var_table)
##      cumulative_variance
## PC1            0.2468787
## PC2            0.4073865
## PC3            0.5508581
## PC4            0.6816184
## PC5            0.7508988
## PC6            0.7983015
## PC7            0.8363971
## PC8            0.8702480
## PC9            0.9018016
## PC10           0.9260419
## PC11           0.9481090
## PC12           0.9672488
## PC13           0.9845281
## PC14           0.9976375
## PC15           1.0000000
# Kaiser criterion
cat("Mean variance =", mean(s), "\n\n")
## Mean variance = 1
# print the table
print(eigenval_table)
##      eigen_value
## PC1   3.70318105
## PC2   2.40761587
## PC3   2.15207400
## PC4   1.96140530
## PC5   1.03920635
## PC6   0.71103945
## PC7   0.57143431
## PC8   0.50776351
## PC9   0.47330385
## PC10  0.36360545
## PC11  0.33100581
## PC12  0.28709740
## PC13  0.25918968
## PC14  0.19664017
## PC15  0.03543781

Scree plot suggests retaining PC1 to PC5 since there are 5 points involved in the steep part of the curve. In total variance, 4 PCs capture around 67.8% of the total variance, 5 PCs capture around 74.8%. If we aim to capture 80 - 90% of the total variance, at least 7 PCs will be required. In Kaiser criterion, variance of PC1 to PC5 are greater than the average of variance, so 4 - 5 PCs are should be retained. After comparing the result, 5 PCs are selected in our PCA.

Then, we obtain the loadings and check if they have some meaningful interpretation.

Loadings

# obtain loadings for eigen vectors

cbind(airline.pca$sdev[1]*airline.pca$rotation[,1],
      airline.pca$sdev[2]*airline.pca$rotation[,2],
      airline.pca$sdev[3]*airline.pca$rotation[,3],
      airline.pca$sdev[4]*airline.pca$rotation[,4],
      airline.pca$sdev[5]*airline.pca$rotation[,5])
##                                          [,1]        [,2]        [,3]
## Inflight.wifi.service             -0.46040963  0.68652186 -0.10600053
## Departure.Arrival.time.convenient -0.17512852  0.65636015 -0.01436309
## Ease.of.Online.booking            -0.32316075  0.81471291 -0.11640261
## Gate.location                     -0.13118367  0.66205628 -0.09979035
## Food.and.drink                    -0.61604949 -0.28902609 -0.47587732
## Online.boarding                   -0.53999268  0.23469053 -0.25608572
## Seat.comfort                      -0.67937232 -0.26904456 -0.44286773
## Inflight.entertainment            -0.85204564 -0.28542168 -0.05205079
## On.board.service                  -0.52317782 -0.06362461  0.59000223
## Leg.room.service                  -0.43133617  0.01440290  0.46116789
## Baggage.handling                  -0.50240317 -0.04353454  0.65275707
## Inflight.service                  -0.51043530 -0.05647838  0.66815265
## Cleanliness                       -0.70409605 -0.30912022 -0.43376258
## Departure.Delay.in.Minutes         0.07333304  0.05501133 -0.04623367
## Arrival.Delay.in.Minutes           0.08003982  0.05505388 -0.05155520
##                                           [,4]          [,5]
## Inflight.wifi.service              0.019965144 -0.2633229687
## Departure.Arrival.time.convenient  0.026016640  0.4388070109
## Ease.of.Online.booking             0.028238479 -0.1749829183
## Gate.location                      0.020145082  0.5053266769
## Food.and.drink                    -0.012738276  0.1959504923
## Online.boarding                    0.013906430 -0.6037944367
## Seat.comfort                      -0.013594784  0.0267267060
## Inflight.entertainment            -0.053700348  0.1415016615
## On.board.service                  -0.038747630 -0.0111975395
## Leg.room.service                  -0.084545279 -0.1818852202
## Baggage.handling                  -0.074081083  0.0757916695
## Inflight.service                  -0.006569759  0.0996334731
## Cleanliness                       -0.034005044  0.1369588462
## Departure.Delay.in.Minutes        -0.985332778 -0.0017441968
## Arrival.Delay.in.Minutes          -0.984624251 -0.0008670196

For example, in PC1, a highly correlated group would be Inflight.entertainment, Cleanliness and Seat.comfort, while for PC2, a highly correlated group would be Gate.location, Ease.of.Online.booking, Inflight.wifi.service and Departure.Arrival.time.convenient. To further enhance the interpretation of the original variables, varimax rotation is considered.

Varimax Rotation

# scale df1
center_df1 <- scale(df1, center = TRUE, scale = TRUE)

# apply varimax rotation to scaled df1
fit <- principal(center_df1, nfactors = 5, rotate = "varimax")
fit$loadings
## 
## Loadings:
##                                   RC1    RC3    RC4    RC5    RC2   
## Inflight.wifi.service                     0.121         0.769  0.392
## Departure.Arrival.time.convenient                       0.131  0.797
## Ease.of.Online.booking                                  0.730  0.525
## Gate.location                                                  0.843
## Food.and.drink                     0.852                            
## Online.boarding                    0.305                0.811 -0.145
## Seat.comfort                       0.836                0.167       
## Inflight.entertainment             0.789  0.454                     
## On.board.service                          0.784                     
## Leg.room.service                          0.623         0.206       
## Baggage.handling                          0.827                     
## Inflight.service                          0.841                     
## Cleanliness                        0.889                            
## Departure.Delay.in.Minutes                       0.990              
## Arrival.Delay.in.Minutes                         0.990              
## 
##                  RC1   RC3   RC4   RC5   RC2
## SS loadings    2.957 2.631 1.969 1.892 1.815
## Proportion Var 0.197 0.175 0.131 0.126 0.121
## Cumulative Var 0.197 0.373 0.504 0.630 0.751

The first 5 components captured around 75.1% of variance in vairmax rotation. Considering a cut-off value of 0.6:

  • Cleanliness, Food.and.drink, Inflight.entertainment and Seat.comfort are loaded on component 1, categorized as In-flight experience.

  • Departure.Arrival.time.convenient and Gate.location are loaded on component 2, categorized as Convenience.

  • Baggage.handling, Inflight.service, Leg.room.service and On.board.service are loaded on component 3, categorized as Service.

  • Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes loaded on component 4, categorized as Punctuation.

  • Inflight.wifi.service, Ease.of.Online.booking and Online.boarding are loaded on component 5, categorized as Technology.

Visualization

From the PCA results, we can also produce a 2-dimensional scatter plot for pairs of components, with each point color-coded according to the response of the satisfaction it represents. The following plot uses PC1 and PC4 to evaluate whether components can be classified by the response variable satisfaction. However, the result is not as expected. This could be due to PC1 and PC4 only capturing 33% of the total variance.

# visualization
pca_data <- as.data.frame(center_df1 %*% fit$loadings)
pca_data$satisfaction <- df$satisfaction

ggplot(pca_data, aes(x = RC1, y = RC4, color = satisfaction)) +
  geom_point()

Try to produce a 2-dimensional scatter plot for pairs of components and compare with other catagorical variales like gender and class, the results are similar.

# visualization
pca_data2 <- as.data.frame(center_df1 %*% fit$loadings)
pca_data2$Class <- air$Class

ggplot(pca_data2, aes(x = RC1, y = RC4, color = Class)) +
  geom_point()

# visualization
pca_data2 <- as.data.frame(center_df1 %*% fit$loadings)
pca_data2$Gender <- air$Gender

ggplot(pca_data2, aes(x = RC1, y = RC4, color = Gender)) +
  geom_point()

Factor Analysis

Data Assessment

Apart from principal component analysis, we could also use factor analysis to identify latent variables in the data sets.

Before conducting factor analysis, it is essential to first assess whether our dataset is appropriate for such analysis. This can be done using Bartlett’s test and KMO test.

Bartlett’s test is a statistical test used to determine whether the correlation matrix of a set of variables is suitable for factor analysis. It tests the null hypothesis that the correlation matrix is an identity matrix, indicating that the variables are uncorrelated and unsuitable for factor analysis. If the null hypothesis is rejected, it indicates that there is significant correlation among the variables, suggesting that factor analysis may be appropriate.

In practice, Bartlett’s test is often used in conjunction with the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy to assess the suitability of data for factor analysis. The KMO measure assesses the degree of common variance among the variables, while Bartlett’s test assesses the overall significance of the correlations. If both tests indicate that the data is suitable for factor analysis, then we can proceed with conducting the analysis.

Our Bartlett’s test produced a p-value of close to 0, which is less than significant level of 0.05, we can reject the null hypothesis and conclude that a factor analysis may be useful for our dataset.

At the same time, KMO test gave an overall MSA of 0.73, meaning there is a significant number of factors in our dataset.

# Bartlett's test
correlations = cor(df1)
cortest.bartlett(correlations, n = nrow(df1))
## $chisq
## [1] 215937.2
## 
## $p.value
## [1] 0
## 
## $df
## [1] 105
# KMO test
KMO(correlations)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = correlations)
## Overall MSA =  0.74
## MSA for each item = 
##             Inflight.wifi.service Departure.Arrival.time.convenient 
##                              0.74                              0.76 
##            Ease.of.Online.booking                     Gate.location 
##                              0.69                              0.70 
##                    Food.and.drink                   Online.boarding 
##                              0.84                              0.72 
##                      Seat.comfort            Inflight.entertainment 
##                              0.82                              0.79 
##                  On.board.service                  Leg.room.service 
##                              0.83                              0.89 
##                  Baggage.handling                  Inflight.service 
##                              0.81                              0.78 
##                       Cleanliness        Departure.Delay.in.Minutes 
##                              0.83                              0.50 
##          Arrival.Delay.in.Minutes 
##                              0.50

Optimal Number of Factors

Similar to principle component analysis, we can determine the number of factors using parallel analysis, Kaiser criterion and Scree plot.

Maximum likelihood estimation (MLE) is used for conducting factor analysis. MLE for factor analysis provides a flexible and powerful method for modeling the underlying factor structure of a set of variables, and is well-suited to a wide range of data types and assumptions.

# parallel analysis
nofactors = fa.parallel(df1, fm="ml", fa="fa")

## Parallel analysis suggests that the number of factors =  5  and the number of components =  NA
# Kaiser criterion
# extract the number of eigenvalues
cat("Old criterion (> 1.0) - Number of factors:", sum(nofactors$fa.values > 1.0),"\n")
## Old criterion (> 1.0) - Number of factors: 4
cat("New criterion (> 0.7) - Number of factors:", sum(nofactors$fa.values > .7))
## New criterion (> 0.7) - Number of factors: 4
# Scree plot and parallel analysis
ev <- eigen(cor(df1)) # get eigenvalues
ap <- parallel(subject=nrow(df1),var=ncol(df1),
               rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)

From the analysis above, the optimal number of factors is suggested to be between 4 and 5. Since we want to compare the result of PCA and FA, we will take the same number of factors as the number of principla components, which is 5.

Factor Analysis without Rotation

# no rotation
df.fa <- fa(df1, nfactors = 5, rotate = "none", fm="ml")

# summary of the FA results
df.fa # some empty loading because the value is too small.
## Factor Analysis using method =  ml
## Call: fa(r = df1, nfactors = 5, rotate = "none", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                                    ML2   ML3   ML1   ML4   ML5   h2    u2 com
## Inflight.wifi.service             0.55 -0.28 -0.01  0.45  0.22 0.64 0.363 2.8
## Departure.Arrival.time.convenient 0.15 -0.21  0.00  0.41  0.23 0.29 0.713 2.4
## Ease.of.Online.booking            0.48 -0.45  0.00  0.53  0.31 0.82 0.184 3.6
## Gate.location                     0.09 -0.19  0.01  0.44  0.35 0.36 0.641 2.4
## Food.and.drink                    0.43  0.52 -0.03 -0.12  0.38 0.62 0.384 2.9
## Online.boarding                   0.91 -0.25 -0.03 -0.18 -0.13 0.94 0.060 1.3
## Seat.comfort                      0.57  0.46 -0.04 -0.21  0.23 0.63 0.369 2.6
## Inflight.entertainment            0.55  0.72 -0.03  0.15  0.07 0.84 0.159 2.0
## On.board.service                  0.26  0.32 -0.03  0.37 -0.44 0.50 0.497 3.5
## Leg.room.service                  0.22  0.20  0.01  0.32 -0.26 0.26 0.744 3.5
## Baggage.handling                  0.21  0.34 -0.01  0.45 -0.47 0.58 0.418 3.2
## Inflight.service                  0.20  0.36 -0.06  0.48 -0.49 0.65 0.353 3.2
## Cleanliness                       0.53  0.58 -0.02 -0.15  0.31 0.73 0.270 2.7
## Departure.Delay.in.Minutes        0.01  0.00  0.97  0.00 -0.01 0.93 0.065 1.0
## Arrival.Delay.in.Minutes          0.00  0.00  1.00  0.00  0.00 1.00 0.005 1.0
## 
##                        ML2  ML3  ML1  ML4  ML5
## SS loadings           2.68 2.14 1.94 1.66 1.35
## Proportion Var        0.18 0.14 0.13 0.11 0.09
## Cumulative Var        0.18 0.32 0.45 0.56 0.65
## Proportion Explained  0.27 0.22 0.20 0.17 0.14
## Cumulative Proportion 0.27 0.49 0.69 0.86 1.00
## 
## Mean item complexity =  2.6
## Test of the hypothesis that 5 factors are sufficient.
## 
## df null model =  105  with the objective function =  8.34 with Chi Square =  215937.2
## df of  the model are 40  and the objective function was  0.14 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic n.obs is  25893 with the empirical chi square  2190.43  with prob <  0 
## The total n.obs was  25893  with Likelihood Chi Square =  3581.96  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.957
## RMSEA index =  0.058  and the 90 % confidence intervals are  0.057 0.06
## BIC =  3175.49
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy             
##                                                    ML2  ML3  ML1  ML4  ML5
## Correlation of (regression) scores with factors   0.98 0.95 1.00 0.92 0.89
## Multiple R square of scores with factors          0.95 0.90 1.00 0.84 0.79
## Minimum correlation of possible factor scores     0.91 0.80 0.99 0.68 0.58
# diagram
fa.diagram(df.fa)

Without any rotation, some groups can still be easily categorized, i.e. ML1 for punctuation and ML4 for convenience, while other groups are not as apparent. Let’s consider rotation such that it redistributes the variance across factors to enhance the interpretation.

Factor Analysis with Rotation

# varimax rotation
df.fa.varimax <- fa(df1, nfactors = 5, rotate = "varimax", fm="ml")

# summary of the FA results
df.fa.varimax # some empty loading because the value is too small.
## Factor Analysis using method =  ml
## Call: fa(r = df1, nfactors = 5, rotate = "varimax", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                                     ML3   ML5   ML1   ML4   ML2   h2    u2 com
## Inflight.wifi.service              0.12  0.13 -0.01  0.71  0.31 0.64 0.363 1.5
## Departure.Arrival.time.convenient -0.04  0.04  0.00  0.53 -0.02 0.29 0.713 1.0
## Ease.of.Online.booking            -0.01  0.04  0.00  0.86  0.27 0.82 0.184 1.2
## Gate.location                      0.00 -0.03  0.01  0.59 -0.11 0.36 0.641 1.1
## Food.and.drink                     0.78  0.00 -0.01  0.03  0.01 0.62 0.384 1.0
## Online.boarding                    0.28  0.09 -0.01  0.19  0.90 0.94 0.060 1.3
## Seat.comfort                       0.76  0.05 -0.02 -0.03  0.22 0.63 0.369 1.2
## Inflight.entertainment             0.79  0.47 -0.01  0.04  0.01 0.84 0.159 1.6
## On.board.service                   0.09  0.70 -0.02  0.00  0.07 0.50 0.497 1.1
## Leg.room.service                   0.08  0.49  0.02  0.09  0.05 0.26 0.744 1.1
## Baggage.handling                   0.05  0.76  0.01  0.02  0.00 0.58 0.418 1.0
## Inflight.service                   0.05  0.80 -0.05  0.02 -0.02 0.65 0.353 1.0
## Cleanliness                        0.85  0.07  0.00 -0.01  0.09 0.73 0.270 1.0
## Departure.Delay.in.Minutes        -0.02 -0.01  0.97  0.00 -0.01 0.93 0.065 1.0
## Arrival.Delay.in.Minutes          -0.02 -0.02  1.00  0.00 -0.01 1.00 0.005 1.0
## 
##                        ML3  ML5  ML1  ML4  ML2
## SS loadings           2.64 2.21 1.93 1.93 1.06
## Proportion Var        0.18 0.15 0.13 0.13 0.07
## Cumulative Var        0.18 0.32 0.45 0.58 0.65
## Proportion Explained  0.27 0.23 0.20 0.20 0.11
## Cumulative Proportion 0.27 0.50 0.69 0.89 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 5 factors are sufficient.
## 
## df null model =  105  with the objective function =  8.34 with Chi Square =  215937.2
## df of  the model are 40  and the objective function was  0.14 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic n.obs is  25893 with the empirical chi square  2190.43  with prob <  0 
## The total n.obs was  25893  with Likelihood Chi Square =  3581.96  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.957
## RMSEA index =  0.058  and the 90 % confidence intervals are  0.057 0.06
## BIC =  3175.49
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy             
##                                                    ML3  ML5  ML1  ML4  ML2
## Correlation of (regression) scores with factors   0.94 0.91 1.00 0.92 0.95
## Multiple R square of scores with factors          0.89 0.84 1.00 0.85 0.91
## Minimum correlation of possible factor scores     0.78 0.67 0.99 0.70 0.82
# diagram
fa.diagram(df.fa.varimax)

# oblimin rotation
df.fa.oblimin <- fa(df1, nfactors = 5, rotate = "oblimin", fm="ml")
## Loading required namespace: GPArotation
# summary of the FA results
df.fa.oblimin # some empty loading because the value is too small.
## Factor Analysis using method =  ml
## Call: fa(r = df1, nfactors = 5, rotate = "oblimin", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                                     ML3   ML5   ML4   ML1   ML2   h2    u2 com
## Inflight.wifi.service              0.06  0.07  0.70  0.00  0.17 0.64 0.363 1.1
## Departure.Arrival.time.convenient -0.01  0.02  0.56  0.00 -0.14 0.29 0.713 1.1
## Ease.of.Online.booking            -0.05 -0.02  0.87  0.00  0.10 0.82 0.184 1.0
## Gate.location                      0.06 -0.06  0.64  0.00 -0.26 0.36 0.641 1.4
## Food.and.drink                     0.82 -0.13  0.04 -0.01 -0.06 0.62 0.384 1.1
## Online.boarding                    0.04  0.01  0.06  0.00  0.93 0.94 0.060 1.0
## Seat.comfort                       0.73 -0.07 -0.06 -0.01  0.20 0.63 0.369 1.2
## Inflight.entertainment             0.78  0.35  0.03  0.01 -0.05 0.84 0.159 1.4
## On.board.service                   0.01  0.70 -0.03 -0.01  0.07 0.50 0.497 1.0
## Leg.room.service                   0.03  0.48  0.07  0.03  0.03 0.26 0.744 1.1
## Baggage.handling                  -0.02  0.77  0.00  0.02  0.00 0.58 0.418 1.0
## Inflight.service                  -0.01  0.81  0.00 -0.03 -0.02 0.65 0.353 1.0
## Cleanliness                        0.85 -0.06 -0.02  0.00  0.04 0.73 0.270 1.0
## Departure.Delay.in.Minutes         0.00  0.00  0.00  0.97  0.00 0.93 0.065 1.0
## Arrival.Delay.in.Minutes           0.00  0.00  0.00  1.00  0.00 1.00 0.005 1.0
## 
##                        ML3  ML5  ML4  ML1  ML2
## SS loadings           2.61 2.14 2.00 1.93 1.10
## Proportion Var        0.17 0.14 0.13 0.13 0.07
## Cumulative Var        0.17 0.32 0.45 0.58 0.65
## Proportion Explained  0.27 0.22 0.20 0.20 0.11
## Cumulative Proportion 0.27 0.49 0.69 0.89 1.00
## 
##  With factor correlations of 
##       ML3   ML5  ML4   ML1   ML2
## ML3  1.00  0.23 0.05 -0.03  0.32
## ML5  0.23  1.00 0.09 -0.03  0.10
## ML4  0.05  0.09 1.00  0.00  0.34
## ML1 -0.03 -0.03 0.00  1.00 -0.02
## ML2  0.32  0.10 0.34 -0.02  1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 5 factors are sufficient.
## 
## df null model =  105  with the objective function =  8.34 with Chi Square =  215937.2
## df of  the model are 40  and the objective function was  0.14 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic n.obs is  25893 with the empirical chi square  2190.43  with prob <  0 
## The total n.obs was  25893  with Likelihood Chi Square =  3581.96  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.957
## RMSEA index =  0.058  and the 90 % confidence intervals are  0.057 0.06
## BIC =  3175.49
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy             
##                                                    ML3  ML5  ML4  ML1  ML2
## Correlation of (regression) scores with factors   0.95 0.92 0.93 1.00 0.97
## Multiple R square of scores with factors          0.91 0.84 0.87 1.00 0.94
## Minimum correlation of possible factor scores     0.82 0.69 0.74 0.99 0.88
# diagram
fa.diagram(df.fa.oblimin)

Varimax rotation produced clearer categorization for the variables. Specifically:

  • In-flight experience: Cleanliness, Seat.comfort, Food.and.drink, Inflight.entertainment

  • Convenience: Ease.of.Online.booking, Inflight.wifi.service, Gate.location, Departure.Arrival.time.convenient

  • Service: Inflight.service, Baggage.handling, On.board.service, Leg.room.service

  • Punctuation: Departure.Delay.in.Minutes, Arrival.Delay.in.Minutes

  • Other: Online.boarding

It’s no surprise that the categorization is similar to PCA and some grouppings are exactly the same. In terms of grouping, varimax rotation gave a more distinguish result in this dataset.

Linear Discriminant Analysis

To maximize the separation between groups, we will conduct Linear Discriminant Analysis (LDA).

# split the dataset into train set and test set (80/20)
set.seed(100)
sample <- sample.int(n = nrow(df), size = floor(.8*nrow(df)), replace = F) # 80% of dataset
train <- df[sample, ]
test <- df[-sample, ]
# LDA on train set using all the variables
lda.sa <- lda(satisfaction~. , data=train)
lda.sa
## Call:
## lda(satisfaction ~ ., data = train)
## 
## Prior probabilities of groups:
## neutral or dissatisfied               satisfied 
##               0.5597663               0.4402337 
## 
## Group means:
##                         Inflight.wifi.service Departure.Arrival.time.convenient
## neutral or dissatisfied              2.393618                          3.133161
## satisfied                            3.152319                          2.942099
##                         Ease.of.Online.booking Gate.location Food.and.drink
## neutral or dissatisfied               2.560414      3.003967       2.965502
## satisfied                             3.019191      2.962057       3.553679
##                         Online.boarding Seat.comfort Inflight.entertainment
## neutral or dissatisfied        2.669340     3.049418               2.889263
## satisfied                      4.021713     3.957780               3.957451
##                         On.board.service Leg.room.service Baggage.handling
## neutral or dissatisfied         3.018284         2.988098         3.371712
## satisfied                       3.852615         3.808751         3.972585
##                         Inflight.service Cleanliness Departure.Delay.in.Minutes
## neutral or dissatisfied         3.391807    2.920828                   15.94756
## satisfied                       3.978397    3.758855                   12.19048
##                         Arrival.Delay.in.Minutes
## neutral or dissatisfied                 16.78456
## satisfied                               12.31407
## 
## Coefficients of linear discriminants:
##                                            LD1
## Inflight.wifi.service              0.167307082
## Departure.Arrival.time.convenient -0.174008029
## Ease.of.Online.booking            -0.085913703
## Gate.location                      0.055907376
## Food.and.drink                    -0.037248138
## Online.boarding                    0.568857329
## Seat.comfort                       0.092439547
## Inflight.entertainment             0.128792389
## On.board.service                   0.207621464
## Leg.room.service                   0.230826043
## Baggage.handling                   0.075863299
## Inflight.service                   0.026366805
## Cleanliness                        0.096395010
## Departure.Delay.in.Minutes         0.003038314
## Arrival.Delay.in.Minutes          -0.005180921
# prediction on test set
sa.pred <- predict(lda.sa, newdata = test)
# accuracy and classification report
summary(sa.pred$class)
## neutral or dissatisfied               satisfied 
##                    2913                    2266
xtab <- table(sa.pred$class, test$satisfaction)

# confusion matrix
caret::confusionMatrix(xtab, positive = "satisfied")
## Registered S3 methods overwritten by 'proxy':
##   method               from    
##   print.registry_field registry
##   print.registry_entry registry
## Confusion Matrix and Statistics
## 
##                          
##                           neutral or dissatisfied satisfied
##   neutral or dissatisfied                    2440       473
##   satisfied                                   493      1773
##                                          
##                Accuracy : 0.8135         
##                  95% CI : (0.8026, 0.824)
##     No Information Rate : 0.5663         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.6207         
##                                          
##  Mcnemar's Test P-Value : 0.541          
##                                          
##             Sensitivity : 0.7894         
##             Specificity : 0.8319         
##          Pos Pred Value : 0.7824         
##          Neg Pred Value : 0.8376         
##              Prevalence : 0.4337         
##          Detection Rate : 0.3423         
##    Detection Prevalence : 0.4375         
##       Balanced Accuracy : 0.8107         
##                                          
##        'Positive' Class : satisfied      
## 
# store probabilities assigned to cases
pb <- sa.pred$posterior

# create data frame from probabilities
pb <- as.data.frame(pb)

# combine probabilities and cases in data frame
pred.LDA <- data.frame(test$satisfaction,pb$satisfied)

# change column names
colnames(pred.LDA) <- c("target","score")

# transform the target to 1 and 0
labels <- as.factor(ifelse(pred.LDA$target=="satisfied", 1, 0))
predictions <- pred.LDA$score
# ROC curve
plot(roc(predictions, labels), min=0, max=1, type="l", main="LDA - ROC Chart")
text(0.5, 0.8, paste0("LDA AUC = ", round(auc(roc(predictions, labels)), 5)), col = "#3F7ED5", cex = 1.0)

The AUC value of 0.87 (> 0.8) and an accuracy of 81.35% indicate that LDA model performed relatively well in predicting the satisfaction level.

# Add predictions as a new column to the dataset for further analysis
df_class <- df
lda.full <- lda(satisfaction~. , data=df)
pred.full <- predict(lda.full)
df_class$lda_class <- pred.full$class
head(df_class)
##   Inflight.wifi.service Departure.Arrival.time.convenient
## 1                     5                                 4
## 2                     1                                 1
## 3                     2                                 0
## 4                     0                                 0
## 5                     2                                 3
## 6                     3                                 3
##   Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1                      3             4              3               4
## 2                      3             1              5               4
## 3                      2             4              2               2
## 4                      0             2              3               4
## 5                      4             3              4               1
## 6                      3             3              5               5
##   Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1            3                      5                5                5
## 2            5                      4                4                4
## 3            2                      2                4                1
## 4            4                      1                1                1
## 5            2                      2                2                2
## 6            3                      5                4                3
##   Baggage.handling Inflight.service Cleanliness Departure.Delay.in.Minutes
## 1                5                5           5                         50
## 2                4                4           5                          0
## 3                3                2           2                          0
## 4                1                1           4                          0
## 5                2                2           4                          0
## 6                1                2           5                          0
##   Arrival.Delay.in.Minutes            satisfaction               lda_class
## 1                       44               satisfied               satisfied
## 2                        0               satisfied               satisfied
## 3                        0 neutral or dissatisfied neutral or dissatisfied
## 4                        6               satisfied neutral or dissatisfied
## 5                       20               satisfied neutral or dissatisfied
## 6                        0               satisfied               satisfied

Stepwise Regression

Before we proceed to perform forward selection, backward elimination and step-wise regression on the same train and test data in LDA session, we first need to convert our response variable, satisfaction into numerical value.

  • 0 : neutral or dissatisfied
  • 1 : satisfied

We will have 2 new dataframes (train_re & test_re) with all 15 predictors and 1 response variable numerically encoded.

train_re <- train
test_re <- test

# convert the response variable into numerical data
train_re$satisfaction <- ifelse(train_re$satisfaction=="satisfied", 1, 0)
test_re$satisfaction <- ifelse(test_re$satisfaction=="satisfied", 1, 0)

head(train_re)
##       Inflight.wifi.service Departure.Arrival.time.convenient
## 20170                     2                                 1
## 16887                     1                                 1
## 3430                      4                                 0
## 3696                      2                                 2
## 20474                     3                                 5
## 24270                     1                                 1
##       Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 20170                      2             3              5               2
## 16887                      2             1              5               5
## 3430                       4             1              3               4
## 3696                       2             4              4               2
## 20474                      3             4              3               3
## 24270                      1             1              3               4
##       Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 20170            5                      5                3                5
## 16887            5                      5                5                2
## 3430             3                      3                4                1
## 3696             4                      4                4                5
## 20474            4                      2                2                3
## 24270            5                      4                4                4
##       Baggage.handling Inflight.service Cleanliness Departure.Delay.in.Minutes
## 20170                4                3           5                         22
## 16887                5                4           5                          0
## 3430                 2                1           3                         21
## 3696                 4                3           4                          0
## 20474                2                2           3                          2
## 24270                4                4           5                          0
##       Arrival.Delay.in.Minutes satisfaction
## 20170                        7            0
## 16887                       47            1
## 3430                        27            1
## 3696                        12            0
## 20474                        4            0
## 24270                        0            1
head(test_re)
##    Inflight.wifi.service Departure.Arrival.time.convenient
## 5                      2                                 3
## 6                      3                                 3
## 9                      5                                 2
## 12                     2                                 5
## 13                     5                                 5
## 17                     2                                 5
##    Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 5                       4             3              4               1
## 6                       3             3              5               5
## 9                       2             2              5               5
## 12                      5             5              1               3
## 13                      5             5              4               5
## 17                      5             5              2               2
##    Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 5             2                      2                2                2
## 6             3                      5                4                3
## 9             5                      5                2                2
## 12            4                      2                2                2
## 13            5                      5                5                5
## 17            2                      2                4                3
##    Baggage.handling Inflight.service Cleanliness Departure.Delay.in.Minutes
## 5                 2                2           4                          0
## 6                 1                2           5                          0
## 9                 5                3           5                          1
## 12                2                2           4                         18
## 13                5                5           3                          0
## 17                3                3           2                          2
##    Arrival.Delay.in.Minutes satisfaction
## 5                        20            1
## 6                         0            1
## 9                         0            1
## 12                        7            0
## 13                        0            1
## 17                        0            0

We will also need to define an intercept-only model an a model with all predictors included, using the train_re data.

# define intercept-only model
intercept_only <- glm(satisfaction ~ 1, data=train_re, family = "binomial"(link="logit"))

# define model with all predictors
all <- glm(satisfaction ~ ., data=train_re, family = "binomial"(link="logit"))

Forward Selection

# perform forward stepwise regression
forward <- step(intercept_only, direction='forward', scope=formula(all), trace=0)

# view results of forward stepwise regression
summary(forward)
## 
## Call:
## glm(formula = satisfaction ~ Online.boarding + Inflight.entertainment + 
##     Leg.room.service + On.board.service + Departure.Arrival.time.convenient + 
##     Inflight.wifi.service + Seat.comfort + Arrival.Delay.in.Minutes + 
##     Ease.of.Online.booking + Cleanliness + Baggage.handling + 
##     Gate.location + Departure.Delay.in.Minutes + Inflight.service + 
##     Food.and.drink, family = binomial(link = "logit"), data = train_re)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6668  -0.7052  -0.2557   0.6598   3.5119  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -7.124802   0.123413 -57.731  < 2e-16 ***
## Online.boarding                    0.839480   0.019052  44.062  < 2e-16 ***
## Inflight.entertainment             0.159662   0.025462   6.271 3.59e-10 ***
## Leg.room.service                   0.356670   0.015826  22.537  < 2e-16 ***
## On.board.service                   0.322724   0.018974  17.009  < 2e-16 ***
## Departure.Arrival.time.convenient -0.277757   0.014535 -19.109  < 2e-16 ***
## Inflight.wifi.service              0.324612   0.022605  14.360  < 2e-16 ***
## Seat.comfort                       0.137125   0.020757   6.606 3.95e-11 ***
## Arrival.Delay.in.Minutes          -0.007914   0.001887  -4.194 2.74e-05 ***
## Ease.of.Online.booking            -0.182070   0.022882  -7.957 1.76e-15 ***
## Cleanliness                        0.155328   0.022550   6.888 5.66e-12 ***
## Baggage.handling                   0.104356   0.021108   4.944 7.66e-07 ***
## Gate.location                      0.080271   0.017263   4.650 3.32e-06 ***
## Departure.Delay.in.Minutes         0.004065   0.001900   2.139   0.0324 *  
## Inflight.service                   0.038413   0.022109   1.737   0.0823 .  
## Food.and.drink                    -0.030381   0.020221  -1.502   0.1330    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28419  on 20713  degrees of freedom
## Residual deviance: 18621  on 20698  degrees of freedom
## AIC: 18653
## 
## Number of Fisher Scoring iterations: 5
# anova
forward$anova
##                                   Step Df    Deviance Resid. Df Resid. Dev
## 1                                      NA          NA     20713   28419.03
## 2                    + Online.boarding -1 5817.785334     20712   22601.24
## 3             + Inflight.entertainment -1 1977.286991     20711   20623.96
## 4                   + Leg.room.service -1  919.615584     20710   19704.34
## 5                   + On.board.service -1  330.017828     20709   19374.32
## 6  + Departure.Arrival.time.convenient -1  313.960049     20708   19060.36
## 7              + Inflight.wifi.service -1  132.350977     20707   18928.01
## 8                       + Seat.comfort -1   88.837003     20706   18839.18
## 9           + Arrival.Delay.in.Minutes -1   56.131794     20705   18783.04
## 10            + Ease.of.Online.booking -1   54.482287     20704   18728.56
## 11                       + Cleanliness -1   38.182605     20703   18690.38
## 12                  + Baggage.handling -1   37.482899     20702   18652.90
## 13                     + Gate.location -1   22.024474     20701   18630.87
## 14        + Departure.Delay.in.Minutes -1    4.706415     20700   18626.17
## 15                  + Inflight.service -1    3.338926     20699   18622.83
## 16                    + Food.and.drink -1    2.262173     20698   18620.56
##         AIC
## 1  28421.03
## 2  22605.24
## 3  20629.96
## 4  19712.34
## 5  19384.32
## 6  19072.36
## 7  18942.01
## 8  18855.18
## 9  18801.04
## 10 18748.56
## 11 18712.38
## 12 18676.90
## 13 18656.87
## 14 18654.17
## 15 18652.83
## 16 18652.56

From the result, we can drop the following variables since keeping them do not reduce the AIC too much:

  • Gate.location
  • Departure.Delay.in.Minutes
  • Inflight.service
  • Food.and.drink

Backward Elimination

# perform backward stepwise regression
backward <- step(all, direction='backward', scope=formula(all), trace=0)

# view results of backward stepwise regression
summary(backward)
## 
## Call:
## glm(formula = satisfaction ~ Inflight.wifi.service + Departure.Arrival.time.convenient + 
##     Ease.of.Online.booking + Gate.location + Food.and.drink + 
##     Online.boarding + Seat.comfort + Inflight.entertainment + 
##     On.board.service + Leg.room.service + Baggage.handling + 
##     Inflight.service + Cleanliness + Departure.Delay.in.Minutes + 
##     Arrival.Delay.in.Minutes, family = binomial(link = "logit"), 
##     data = train_re)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6668  -0.7052  -0.2557   0.6598   3.5119  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -7.124802   0.123413 -57.731  < 2e-16 ***
## Inflight.wifi.service              0.324612   0.022605  14.360  < 2e-16 ***
## Departure.Arrival.time.convenient -0.277757   0.014535 -19.109  < 2e-16 ***
## Ease.of.Online.booking            -0.182070   0.022882  -7.957 1.76e-15 ***
## Gate.location                      0.080271   0.017263   4.650 3.32e-06 ***
## Food.and.drink                    -0.030381   0.020221  -1.502   0.1330    
## Online.boarding                    0.839480   0.019052  44.062  < 2e-16 ***
## Seat.comfort                       0.137125   0.020757   6.606 3.95e-11 ***
## Inflight.entertainment             0.159662   0.025462   6.271 3.59e-10 ***
## On.board.service                   0.322724   0.018974  17.009  < 2e-16 ***
## Leg.room.service                   0.356670   0.015826  22.537  < 2e-16 ***
## Baggage.handling                   0.104356   0.021108   4.944 7.66e-07 ***
## Inflight.service                   0.038413   0.022109   1.737   0.0823 .  
## Cleanliness                        0.155328   0.022550   6.888 5.66e-12 ***
## Departure.Delay.in.Minutes         0.004065   0.001900   2.139   0.0324 *  
## Arrival.Delay.in.Minutes          -0.007914   0.001887  -4.194 2.74e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28419  on 20713  degrees of freedom
## Residual deviance: 18621  on 20698  degrees of freedom
## AIC: 18653
## 
## Number of Fisher Scoring iterations: 5
# anova
backward$anova
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1      NA       NA     20698   18620.56 18652.56

From the result, we do not drop any variable in backward elimination process.

Stepwise Regression

# perform  stepwise regression
both <- step(intercept_only, direction='both', scope=formula(all), trace=0)

# view results of stepwise regression
summary(both)
## 
## Call:
## glm(formula = satisfaction ~ Online.boarding + Inflight.entertainment + 
##     Leg.room.service + On.board.service + Departure.Arrival.time.convenient + 
##     Inflight.wifi.service + Seat.comfort + Arrival.Delay.in.Minutes + 
##     Ease.of.Online.booking + Cleanliness + Baggage.handling + 
##     Gate.location + Departure.Delay.in.Minutes + Inflight.service + 
##     Food.and.drink, family = binomial(link = "logit"), data = train_re)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6668  -0.7052  -0.2557   0.6598   3.5119  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -7.124802   0.123413 -57.731  < 2e-16 ***
## Online.boarding                    0.839480   0.019052  44.062  < 2e-16 ***
## Inflight.entertainment             0.159662   0.025462   6.271 3.59e-10 ***
## Leg.room.service                   0.356670   0.015826  22.537  < 2e-16 ***
## On.board.service                   0.322724   0.018974  17.009  < 2e-16 ***
## Departure.Arrival.time.convenient -0.277757   0.014535 -19.109  < 2e-16 ***
## Inflight.wifi.service              0.324612   0.022605  14.360  < 2e-16 ***
## Seat.comfort                       0.137125   0.020757   6.606 3.95e-11 ***
## Arrival.Delay.in.Minutes          -0.007914   0.001887  -4.194 2.74e-05 ***
## Ease.of.Online.booking            -0.182070   0.022882  -7.957 1.76e-15 ***
## Cleanliness                        0.155328   0.022550   6.888 5.66e-12 ***
## Baggage.handling                   0.104356   0.021108   4.944 7.66e-07 ***
## Gate.location                      0.080271   0.017263   4.650 3.32e-06 ***
## Departure.Delay.in.Minutes         0.004065   0.001900   2.139   0.0324 *  
## Inflight.service                   0.038413   0.022109   1.737   0.0823 .  
## Food.and.drink                    -0.030381   0.020221  -1.502   0.1330    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28419  on 20713  degrees of freedom
## Residual deviance: 18621  on 20698  degrees of freedom
## AIC: 18653
## 
## Number of Fisher Scoring iterations: 5
# anova
both$anova
##                                   Step Df    Deviance Resid. Df Resid. Dev
## 1                                      NA          NA     20713   28419.03
## 2                    + Online.boarding -1 5817.785334     20712   22601.24
## 3             + Inflight.entertainment -1 1977.286991     20711   20623.96
## 4                   + Leg.room.service -1  919.615584     20710   19704.34
## 5                   + On.board.service -1  330.017828     20709   19374.32
## 6  + Departure.Arrival.time.convenient -1  313.960049     20708   19060.36
## 7              + Inflight.wifi.service -1  132.350977     20707   18928.01
## 8                       + Seat.comfort -1   88.837003     20706   18839.18
## 9           + Arrival.Delay.in.Minutes -1   56.131794     20705   18783.04
## 10            + Ease.of.Online.booking -1   54.482287     20704   18728.56
## 11                       + Cleanliness -1   38.182605     20703   18690.38
## 12                  + Baggage.handling -1   37.482899     20702   18652.90
## 13                     + Gate.location -1   22.024474     20701   18630.87
## 14        + Departure.Delay.in.Minutes -1    4.706415     20700   18626.17
## 15                  + Inflight.service -1    3.338926     20699   18622.83
## 16                    + Food.and.drink -1    2.262173     20698   18620.56
##         AIC
## 1  28421.03
## 2  22605.24
## 3  20629.96
## 4  19712.34
## 5  19384.32
## 6  19072.36
## 7  18942.01
## 8  18855.18
## 9  18801.04
## 10 18748.56
## 11 18712.38
## 12 18676.90
## 13 18656.87
## 14 18654.17
## 15 18652.83
## 16 18652.56

From the result, we can drop the following variables since keeping them do not reduce the AIC too much:

  • Gate.location
  • Departure.Delay.in.Minutes
  • Inflight.service
  • Food.and.drink
# Create the final model dropping the above 4 variables:
model_final <- glm(formula = satisfaction ~ Inflight.wifi.service + Departure.Arrival.time.convenient +
      Ease.of.Online.booking + Online.boarding + Seat.comfort + Inflight.entertainment +
      On.board.service + Leg.room.service + Baggage.handling + Cleanliness +
      Arrival.Delay.in.Minutes, family = binomial(logit), data = train_re)

summary(model_final)
## 
## Call:
## glm(formula = satisfaction ~ Inflight.wifi.service + Departure.Arrival.time.convenient + 
##     Ease.of.Online.booking + Online.boarding + Seat.comfort + 
##     Inflight.entertainment + On.board.service + Leg.room.service + 
##     Baggage.handling + Cleanliness + Arrival.Delay.in.Minutes, 
##     family = binomial(logit), data = train_re)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6264  -0.7030  -0.2548   0.6603   3.4835  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -6.9804384  0.1124365 -62.083  < 2e-16 ***
## Inflight.wifi.service              0.3316896  0.0226146  14.667  < 2e-16 ***
## Departure.Arrival.time.convenient -0.2575741  0.0138875 -18.547  < 2e-16 ***
## Ease.of.Online.booking            -0.1567413  0.0222052  -7.059 1.68e-12 ***
## Online.boarding                    0.8218477  0.0185445  44.318  < 2e-16 ***
## Seat.comfort                       0.1363618  0.0204004   6.684 2.32e-11 ***
## Inflight.entertainment             0.1582016  0.0230386   6.867 6.56e-12 ***
## On.board.service                   0.3315938  0.0182477  18.172  < 2e-16 ***
## Leg.room.service                   0.3589314  0.0157224  22.829  < 2e-16 ***
## Baggage.handling                   0.1188003  0.0194095   6.121 9.32e-10 ***
## Cleanliness                        0.1425549  0.0216627   6.581 4.68e-11 ***
## Arrival.Delay.in.Minutes          -0.0040447  0.0005157  -7.844 4.38e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28419  on 20713  degrees of freedom
## Residual deviance: 18653  on 20702  degrees of freedom
## AIC: 18677
## 
## Number of Fisher Scoring iterations: 5

Model Evaluation

We will use the model_final to predict and evaluate its performance on the test set.

# prediction using test_re data
test_re$fitted <- predict(model_final, test_re)
# calculate the predicted probabilites
test_re$prob <- exp(test_re$fitted)/(1+exp(test_re$fitted))

# confusion matrix and statistics
prob_cut_off = sum(test_re$satisfaction)/nrow(test_re)
test_re$predict <- as.numeric(test_re$prob > prob_cut_off)
xtab_num <- table(test_re$predict, test_re$satisfaction)

caret::confusionMatrix(xtab_num, positive = "1")
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 2332  405
##   1  601 1841
##                                           
##                Accuracy : 0.8058          
##                  95% CI : (0.7947, 0.8165)
##     No Information Rate : 0.5663          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6086          
##                                           
##  Mcnemar's Test P-Value : 7.845e-10       
##                                           
##             Sensitivity : 0.8197          
##             Specificity : 0.7951          
##          Pos Pred Value : 0.7539          
##          Neg Pred Value : 0.8520          
##              Prevalence : 0.4337          
##          Detection Rate : 0.3555          
##    Detection Prevalence : 0.4715          
##       Balanced Accuracy : 0.8074          
##                                           
##        'Positive' Class : 1               
## 
# AUC plot (# Both curves from LDA and Step-wise regression in the same graph)

rocplot_lda <- roc(predictions, labels) # from LDA
rocplot_lda <- roc(predictions, labels) # from LDA
rocplot_re <- roc(test_re$prob, as.factor(test_re$satisfaction)) # from Step-wise model

plot(rocplot_lda, col = 'red', lty = 1, main = 'ROC')
plot(rocplot_re, col = '#3F7ED5', lty = 2, add = TRUE)

legend(0.8, 0.15, legend=c("LDA", "Step-wise"),
       col=c("red", "#3F7ED5"), lty=1:2)

text(0.3, 0.70, paste0("LDA AUC = ", round(auc(rocplot_lda), 5)), col = "red", cex = 1.0)
text(0.3, 0.60, paste0("Step-wise Regression AUC = ", round(auc(rocplot_re), 5)), col = "#3F7ED5", cex = 1.0)

As we can see from the plot, the AUC value from the LDA and stepwise Rrgression model are similar (about 0.87) which indicates that both methods perform similarly well in terms of their ability to distinguish between the customers feeling satisfied and neutral or dissatisfied.

Correspondence Analysis

To further explore the relationship between the rated features and the different ratings, we carry out correspondence analysis on the 10 variables that were rated.

  • Inflight.wifi.service
  • Departure.Arrival.time.convenient
  • Ease.of.Online.booking
  • Food.and.drink
  • Online.boarding
  • Inflight.entertainment
  • On.board.service
  • Leg.room.service
  • Inflight.service
  • Cleanliness

Contigency Table

We will first create a contigency table with the 10 rated features as columns and the 6 ratings (0 - 5) as rows.

# create a contingency table
attach(df1_CA)
names = colnames(df1_CA)
c_matrix = matrix(0, nrow = 6, ncol = 10)

for (i in 1:10){
  for (j in 1:6){
    c_matrix[j,i] = table(df1_CA[names[i]])[j]
  }
}
detach(df1_CA)

# convert table from matrix to dataframe
ctable <- data.frame(c_matrix, check.names = FALSE)

# rename rows (rating from 0 - 5)
row.names(ctable) <- 0:5

# rename columns (10 rated features)
colnames(ctable) <- colnames(df1_CA)
# contigency table
ctable
##   Inflight.wifi.service Departure.Arrival.time.convenient
## 0                   812                              1374
## 1                  4469                              3899
## 2                  6481                              4336
## 3                  6298                              4399
## 4                  4965                              6312
## 5                  2868                              5573
##   Ease.of.Online.booking Food.and.drink Online.boarding Inflight.entertainment
## 0                   1193             25             651                      4
## 1                   4342           3210            2558                   3193
## 2                   6021           5375            4417                   4318
## 3                   5927           5474            5296                   4725
## 4                   4854           6183            7682                   7347
## 5                   3556           5626            5289                   6306
##   On.board.service Leg.room.service Inflight.service Cleanliness
## 0                2              126                2           2
## 1             2906             2536             1775        3404
## 2             3658             5000             2838        3968
## 3             5690             4940             5005        6046
## 4             7814             7075             9352        6771
## 5             5823             6216             6921        5702

A quick chi-square test conducted on the contingency table presented a p-value of approximately 0, indicating that the row variables (ratings) has statistically significant associations with the column variables (rated features).

# chi-square test
chisq <- chisq.test(ctable)
chisq
## 
##  Pearson's Chi-squared test
## 
## data:  ctable
## X-squared = 16298, df = 45, p-value < 2.2e-16

Correspondence Analysis

# CA
res.ca.air <- CA(ctable, graph = FALSE)

summary(res.ca.air)
## 
## Call:
## CA(X = ctable, graph = FALSE) 
## 
## The chi square of independence between the two variables is equal to 16297.54 (p-value =  0 ).
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5
## Variance               0.047   0.011   0.002   0.002   0.000
## % of var.             74.523  18.147   3.904   2.944   0.482
## Cumulative % of var.  74.523  92.670  96.574  99.518 100.000
## 
## Rows
##                                     Iner*1000    Dim.1    ctr   cos2    Dim.2
## 0                                 |    24.460 |  1.056 38.479  0.738 |  0.627
## 1                                 |     7.662 |  0.224 13.380  0.819 | -0.050
## 2                                 |     9.049 |  0.186 13.284  0.689 | -0.099
## 3                                 |     2.457 |  0.040  0.696  0.133 | -0.074
## 4                                 |     9.297 | -0.171 16.526  0.834 |  0.062
## 5                                 |    10.016 | -0.199 17.635  0.826 |  0.062
##                                      ctr   cos2    Dim.3    ctr   cos2  
## 0                                 55.770  0.260 | -0.031  0.639  0.001 |
## 1                                  2.741  0.041 |  0.022  2.459  0.008 |
## 2                                 15.488  0.195 |  0.043 13.188  0.036 |
## 3                                 10.039  0.467 | -0.064 35.085  0.351 |
## 4                                  8.949  0.110 | -0.037 14.481  0.038 |
## 5                                  7.013  0.080 |  0.063 34.148  0.084 |
## 
## Columns
##                                     Iner*1000    Dim.1    ctr   cos2    Dim.2
## Inflight.wifi.service             |    13.211 |  0.338 24.354  0.865 | -0.123
## Departure.Arrival.time.convenient |     9.887 |  0.199  8.438  0.400 |  0.234
## Ease.of.Online.booking            |    13.423 |  0.366 28.506  0.996 |  0.005
## Food.and.drink                    |     2.167 | -0.038  0.315  0.068 | -0.127
## Online.boarding                   |     1.493 | -0.014  0.039  0.012 |  0.092
## Inflight.entertainment            |     2.736 | -0.143  4.375  0.750 | -0.032
## On.board.service                  |     3.278 | -0.167  5.970  0.854 | -0.030
## Leg.room.service                  |     2.136 | -0.111  2.620  0.575 | -0.031
## Inflight.service                  |    12.201 | -0.331 23.317  0.896 |  0.098
## Cleanliness                       |     2.408 | -0.098  2.065  0.402 | -0.086
##                                      ctr   cos2    Dim.3    ctr   cos2  
## Inflight.wifi.service             13.166  0.114 | -0.045  8.085  0.015 |
## Departure.Arrival.time.convenient 47.945  0.554 |  0.052 10.818  0.027 |
## Ease.of.Online.booking             0.026  0.000 | -0.015  0.965  0.002 |
## Food.and.drink                    14.165  0.747 |  0.059 14.295  0.162 |
## Online.boarding                    7.385  0.565 | -0.049  9.957  0.164 |
## Inflight.entertainment             0.906  0.038 |  0.062 15.850  0.142 |
## On.board.service                   0.787  0.027 | -0.050 10.180  0.076 |
## Leg.room.service                   0.842  0.045 |  0.063 16.312  0.188 |
## Inflight.service                   8.328  0.078 | -0.052 10.948  0.022 |
## Cleanliness                        6.451  0.306 | -0.025  2.589  0.026 |

Biplot

Dimension 1 (74.48%) and dimension 2 (18.23%) add up to nearly 93%. This means the two dimensions capture most of the information in the dataset.

Overall, most ranked features are grouped together with the exception of Departure.Arrival.time.convenient. This could mean that Departure.arrival.time.convenient is less strongly associated with the other features.

Whereas Inflight.wifi.service and Ease.of.Online.booking have their own little cluster. Departure.Arrival.time.convenient has little similarity with the rest of the variables. The variables that are the least dissimilar are Ease.of.Online.booking and Online.boarding.

Rating 0 is very far away from all rated features, suggesting that rating 0 is highly discriminated from the other ratings and rated features. This is reasonable, since 0 means the services were not available on the flights for passengers to rate.

There is a division between rating 1 - 3 in the upper right quadrant of the plot (positive association between ratings and rated features) and rating 4 - 5 in the lower left quadrant (negative association between ratings and rated features). This means that the relationship between ratings and rated features is not consistently positive or negative, and may indicate subgroups within the ratings or features.

Now let’s examine the relationship between the column variables which are the services, and the row variables, which are the ratings.

We can see that the rating of 2 and Inflight.wifi.service are both distant from the origin, meaning Inflight.wifi.service has an association with a low rating of 2. So Inflight.wifi.service has mostly been not great.

Similarly, Ease.of.Online.booking and rating 1 are far from the origin, and the angle in between are small. So it’s not easy to book online. Inflight.service has mostly been great, whereas most of the service variables in the clusters are somewhere between neutral to positive.

# biplot
fviz_ca_biplot(res.ca.air, repel = TRUE)

Contribution of Row and Column Variables

Dimension 1

In dimension 1, rating 0 is the largest contributor to thee total variance amongst the row variables. However, it should be noted that 0 in our dataset does not indicate the lowest level of satisfaction, but that particular services were not offered on the flights and therefore, not applicable to rating. Amogst the column variables, Ease.of.Online.booking, Inflight.wifi.service and Inflight.service are the largest contributors.

# contribution of axis 1 - row variables
fviz_contrib(res.ca.air, choice="row", axes=1)

# contribution of axis 1 - column variables
fviz_contrib(res.ca.air, choice="col", axes=1)

Dimension 2

In dimension 2, once again, rating 0 captures the largest amount of variance, along with Departure.Arrival.time.convenient.

# contribution of axis 2 - row variables
fviz_contrib(res.ca.air, choice="row", axes=2)

# contribution of axis 2 - column variables
fviz_contrib(res.ca.air, choice="col", axes=2)

Further Analysis

Logistic Regression

To further explore the prediction power of the statistical tools we have utilized so far, we will conduct a comparison between the following models:

  • LDA model (in previous section): consists of 10 ranked variables and the response variable.
  • Step-wise regression model (in previous section): consists of numerical and ranked variables (note that Food.and.drink and Inflight.service are excluded), and the response variable.
  • Logistic regression model with only PCs: consists of only the principal components and the response variable.
  • Logistic regression model with all numerical and categorical variables from the original dataset (note that id is excluded, and the response variable.

Model with Principal Components

# convert PCs into a dataframe
df_pca <- as.data.frame(airline.pca$x)

# add the response variable to the dataframe
df_pca["satisfaction"] <- df$satisfaction

# numerically encode the response variable
df_pca$satisfaction <- ifelse(df_pca$satisfaction=="satisfied", 1, 0)

# a dataframe of only PCs and the response variable
head(df_pca)
##          PC1        PC2         PC3        PC4         PC5        PC6
## 1 -2.7088943  1.0833865  1.31527211 -1.4199776  0.13475759  0.2693778
## 2 -1.6103765 -2.4837470 -0.44022622  0.3543749 -0.86799607  0.1264887
## 3  2.8409125 -0.4185596 -0.05629095  0.7029403 -0.01498055 -0.5642194
## 4  3.4419198 -2.6545624 -3.19878268  0.6617597 -0.93195332 -0.2853510
## 5  2.1482181  0.2402589 -1.72739473  0.3664105  0.94506445  0.4722932
## 6 -0.9118258 -0.2400422 -2.51014836  0.6044318 -0.51284227  0.2438900
##          PC7        PC8         PC9         PC10        PC11        PC12
## 1 -0.2329317  0.0166523 -0.21475732 -0.005031209  0.74897571 -1.03958767
## 2  0.1124948  0.1234445 -0.01161041 -0.062345268 -0.53491978 -0.03727264
## 3 -1.3645888 -1.8495066 -0.91635691 -0.877084449  0.09403787 -0.13529958
## 4  0.4647119 -1.1738962  0.59454692  0.028274160 -0.27491805 -0.56730652
## 5 -0.7558237  0.7884415 -0.88083144 -0.179415316  0.04975187 -0.60585078
## 6  0.3087657 -0.2262218 -1.90000231  0.326358145 -0.52122419 -1.33555626
##          PC13        PC14         PC15 satisfaction
## 1  0.90557801  0.12487081 -0.124477572            1
## 2 -1.33761503  0.50009863  0.003429964            1
## 3  0.02900768  0.09296906 -0.019406409            0
## 4  0.13917492  0.61890248  0.096009497            1
## 5 -1.13850144  0.53133554  0.360029351            1
## 6  0.30752308 -0.70375654 -0.022490007            1
# train/test split (80/20)
sample_pca <- sample.int(n = nrow(df_pca), size = floor(.8*nrow(df_pca)), replace = F) # 80% of dataset
train_pca <- df_pca[sample, ]
test_pca <- df_pca[-sample, ]
# fit model on train set
model_train_pca <- glm(satisfaction ~ ., family = "binomial"(link="logit"), data = train_pca)

# use test set to predict
test_pca$fitted_pca <- predict(model_train_pca, test_pca)
# calculate the predicted probabilities
test_pca$prob_pca <- exp(test_pca$fitted_pca)/(1+exp(test_pca$fitted_pca))

# the classification report: sensitivity, specificity and accuracy
prob_cut_off_pca = sum(test_pca$satisfaction)/nrow(test_pca)
test_pca$predict <- as.numeric(test_pca$prob_pca > prob_cut_off_pca)
xtab_pca <- table(test_pca$predict, test_pca$satisfaction)

caret::confusionMatrix(xtab_pca, positive = "1")
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 2325  408
##   1  608 1838
##                                           
##                Accuracy : 0.8038          
##                  95% CI : (0.7927, 0.8146)
##     No Information Rate : 0.5663          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6047          
##                                           
##  Mcnemar's Test P-Value : 4.287e-10       
##                                           
##             Sensitivity : 0.8183          
##             Specificity : 0.7927          
##          Pos Pred Value : 0.7514          
##          Neg Pred Value : 0.8507          
##              Prevalence : 0.4337          
##          Detection Rate : 0.3549          
##    Detection Prevalence : 0.4723          
##       Balanced Accuracy : 0.8055          
##                                           
##        'Positive' Class : 1               
## 

Model with all Numerical and Categorical Variables

# convert categorical variables into numerical
all_df$satisfaction <- ifelse(all_df$satisfaction=="satisfied", 1, 0)
all_df$Class <- as.numeric(as.factor(all_df$Class))
all_df$Customer.Type <- ifelse(all_df$Customer.Type =="Loyal Customer", 1, 0)
all_df$Gender <- ifelse(all_df$Gender=="Male", 1, 0)
all_df$Type.of.Travel <- ifelse(all_df$Type.of.Travel =="Personal Travel", 1, 0)

head(all_df,2)
##   Gender Customer.Type Age Type.of.Travel Class Flight.Distance
## 1      0             1  52              0     2             160
## 2      0             1  36              0     1            2863
##   Inflight.wifi.service Departure.Arrival.time.convenient
## 1                     5                                 4
## 2                     1                                 1
##   Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1                      3             4              3               4
## 2                      3             1              5               4
##   Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1            3                      5                5                5
## 2            5                      4                4                4
##   Baggage.handling Checkin.service Inflight.service Cleanliness
## 1                5               2                5           5
## 2                4               3                4           5
##   Departure.Delay.in.Minutes Arrival.Delay.in.Minutes satisfaction
## 1                         50                       44            1
## 2                          0                        0            1
# train/test split (80/20)
sample_cat <- sample.int(n = nrow(all_df), size = floor(.8*nrow(all_df)), replace = F) # 80% of dataset
train_cat <- all_df[sample_cat, ]
test_cat <- all_df[-sample_cat, ]
# fit model on train set
model_cat <- glm(satisfaction ~ ., family = "binomial"(link="logit"), data = train_cat)

# use test set to predict
test_cat$fitted <- predict(model_cat, test_cat)
# calculate the predicted probabilities
test_cat$prob <- exp(test_cat$fitted)/(1+exp(test_cat$fitted))

# the classification report: sensitivity, specificity and accuracy
prob_cut_off_cat = sum(test_cat$satisfaction)/nrow(test_cat)
test_cat$predict <- as.numeric(test_cat$prob > prob_cut_off_cat)
xtab_cat <- table(test_cat$predict, test_cat$satisfaction)

caret::confusionMatrix(xtab_cat, positive = "1")
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 2533  321
##   1  360 1965
##                                          
##                Accuracy : 0.8685         
##                  95% CI : (0.859, 0.8776)
##     No Information Rate : 0.5586         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7338         
##                                          
##  Mcnemar's Test P-Value : 0.1453         
##                                          
##             Sensitivity : 0.8596         
##             Specificity : 0.8756         
##          Pos Pred Value : 0.8452         
##          Neg Pred Value : 0.8875         
##              Prevalence : 0.4414         
##          Detection Rate : 0.3794         
##    Detection Prevalence : 0.4489         
##       Balanced Accuracy : 0.8676         
##                                          
##        'Positive' Class : 1              
## 

When comparing the prediction results from the LDA model, the logistic regression model with only principal components and the logistic regression model with all numerical and categorical variables, we can see that all three models perform similarly well. Specifically, they achieved an accuracy of 80.4 - 86.8%, sensitivity ranging from 78.2 - 84.6% and specificity ranging from 79.3 - 88.5%. The logistic regression model with all numerical and categorical variables has the highest accuracy, sensitivity and specificity, although the improvement over other models is not significant enough to warrant further consideration.

Reliability of Factor Analysis

We can use Cronbach’s alpha coefficient to assess internal consistency, which measures the degree to which items in a factor are correlated with each other. Cronbach’s alpha ranges from 0 to 1, where higher values indicate greater internal consistency. A coefficient higher than 0.6 is generally considered acceptable for most research purposes.

Except for the group “Others”, the value of Cronbach’s alpha coefficients are higher than 0.6, indicating a moderate to high level of internal consistency between the items in each group.

Since the group “Others” consists of only one item, Online.Boarding, Cronbach’s alpha coefficient will not be applicable.

# group the data
In_flight_experience <- subset(df, select = c(Inflight.entertainment, Cleanliness, Seat.comfort, Food.and.drink))
Convenience <- subset(df, select = c(Ease.of.Online.booking, Inflight.wifi.service, Gate.location, Departure.Arrival.time.convenient))
Service <- subset(df, select = c(Baggage.handling, Inflight.service, On.board.service, Leg.room.service))
Punctuation <- subset(df, select = c(Departure.Delay.in.Minutes, Arrival.Delay.in.Minutes))
# Cronbach's alpha coefficient
cronbach.alpha(In_flight_experience)
## 
## Cronbach's alpha for the 'In_flight_experience' data-set
## 
## Items: 4
## Sample units: 25893
## alpha: 0.879
cronbach.alpha(Convenience)
## 
## Cronbach's alpha for the 'Convenience' data-set
## 
## Items: 4
## Sample units: 25893
## alpha: 0.772
cronbach.alpha(Service)
## 
## Cronbach's alpha for the 'Service' data-set
## 
## Items: 4
## Sample units: 25893
## alpha: 0.777
cronbach.alpha(Punctuation)
## 
## Cronbach's alpha for the 'Punctuation' data-set
## 
## Items: 2
## Sample units: 25893
## alpha: 0.982

Conclusion

This project employed Principal Component Analysis (PCA) and Factor Analysis (FA) to examine the fundamental structure of the US Airline Passengers dataset and detect features that are strongly linked to the passengers’ satisfaction level. The results showed that while there are some differences in the grouping between PCA and FA, interval features can generally be categorized into 5 groups:

  • In-flight experience
  • Convenience
  • Service
  • Punctuation
  • Technology (PCA) / Others (FA)

To assess the reliability of FA, Cronbach’s alpha coefficient for each group created from the FA procedure was calculated. The coefficient values of higher than 0.6 confirmed a relatively high internal consistency between items in each group.

Linear Discriminant Analysis (LDA) was also utilized on the basis of a train/test dataset split in search of linear combinations of variables that could maximize the separation between groups and afterward, prediction of passengers’ satisfaction level. The outcomes indicated an accuracy of 81.35%, and the AUC value of 0.87 confirmed the relatively good performance of the prediction model.

A model built on the results of step-wise regression was also studied and compared with the LDA model. In terms of accuracy, the step-wise regression model performed as well as the LDA model, achieving an accuracy of 80.38% and the same AUC value of 0.87.

To delve deeper into the relationship between the rated features and the different ratings, correspondence analysis was conducted. Approximately 92.71% of the total variance are captured in the first 2 dimensions, with rating 0 (service not applicable for rating) being the largest contributor in both dimensions.

The performance of the LDA model, the step-wise regression model, a logistic regression model with only the principal components and a logistic regression model with all original variables are comparable.

References

SEE, J. C. G., & PORIO, E. E. (2015). Assessing Social Vulnerability to Flooding in Metro Manila Using Principal Component Analysis. Philippine Sociological Review, 63, 53–80. http://www.jstor.org/stable/24717187